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Abstract — We present a principled approach for detecting 
overlapping temporal community structure in dynamic networks. 
Our method is based on the following framework: find the 
overlapping temporal community structure that maximizes a 
quality function associated with each snapshot of the network 
subject to a temporal smoothness constraint. A novel quality 
function and a smoothness constraint are proposed to handle 
overlaps, and a new convex relaxation is used to solve the result- 
ing combinatorial optimization problem. We provide theoretical 
guarantees as well as experimental results that reveal community 
structure in real and synthetic networks. Our main insight is 
that certain structures can be identified only when temporal 
correlation is considered and when communities are allowed 
to overlap. In general, discovering such overlapping temporal 
community structure can enhance our understanding of real- 
world complex networks by revealing the underlying stability 
behind their seemingly chaotic evolution. 

I. Introduction 

Communities are densely connected groups of nodes in a 
network. Community detection, which attempts to identify 
such communities, is a fundamental primitive in the analysis 
of complex networked systems that span multiple disciplines 
in network science such as biological networks, online social 
networks, epidemic networks, communication networks, etc. It 
serves as an important tool for understanding the underlying, 
often latent, structure of networks and has a wide range of 
applications: user profiling for online marketing, computer 
virus spread and spam detection, understanding protein-protein 
interactions, content caching, to name a few. The concept of 
communities has been generalized to overlapping communities 
which allows nodes to belong to multiple communities at the 
same time. This has been shown to reveal the latent structure 
at multiple levels of hierarchy. 

Community detection in static networks has been studied 
extensively (see (TJ for a comprehensive survey), but has 
primarily been applied to social networks and information 
networks. Applications to communications networks have been 
few. Perhaps this is because communication networks (and in 
particular wireless networks) change at a much faster timescale 
than social and information networks, and the science of com- 
munity detection in time- varying networks is still developing. 
In this paper, we hope to narrow this gap by providing efficient 
techniques for detecting communities in networks that vary 
over time while allowing such communities to be overlapping 
as we elaborate below. 

Temporal community detection a i ms t0 identify how 

communities emerge, grow, combine, and decay over time. 
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Fig. 1. A schematic illustrating the various notions of community structure 
in networks. Panel a) shows a typical community structure in an example 
network. If one uses a quality function and methods that look for overlapping 
community structure on the same network, one could find a structure shown 
in panel b). When the network is time- varying, we illustrate the temporal 
community structure by showing the communities that a node belong to over 
time. The top panels are single snapshots of the network evolution in the 
bottom panels. The (non-overlapping) temporal community structure in panel 
c) reveals how communities change with time. The overlapping temporal 
community structure shown in panel d) can uncover deeper hidden patterns 
such as small communities persistent over time (shown in blue). 



This is useful in practice because most networks of interest, 
particularly communication networks, are time- varying. Typi- 
cally, temporal community detection enforces some continuity 
or "smoothness" with the past community structure as the 
network evolves. While one could apply static community 
detection independently in each snapshot, this fails to dis- 
cover small, yet persistent communities because, without the 
smoothness constraints, these structures would be buried in 
noise and thus be unidentifiable to static methods. 

In this paper, we propose a principled framework that goes 
beyond regular temporal communities and incorporates the 
concept of overlapping temporal communities (OTC). In this 
formulation, nodes can belong to multiple communities at 
any given time and those communities can persist over time 
as well. This allows us to detect even more subtle persis- 
tent structure that would otherwise be subsumed into larger 
communities. We illustrate the various notions of community 
structure in Fig.[T] As we will demonstrate in Section |V| using 
both synthetic and real-world data, our framework is able to 
correctly identify such community structure. 

Knowing the OTC structure of an evolving network is 
useful because, although the observed network may change 



rapidly, its latent structure often evolves much more slowly. 
For example, contact-based social networks might change 
from day to day due to people's varying daily activities, but 
the social groups (e.g. family, friends, colleagues) that people 
belong to are much more stable. Identifying such latent time- 
persistent structure can reveal the fundamental rules governing 
the seemingly chaotic evolution of real-life complex networks. 
In addition, knowledge of the times when significant changes 
occur could be used for predicting network evolution. 

The OTC structure of networks has many applications to 
designing communication networks as well. For example, it 
can be used in efficient distributed storage of information in 
a wireless network so that average access latency is mini- 
mized [6]. The OTC structure can also be used to select relay 
nodes and design routing schemes in a disruption tolerant 
wireless networks. Another application is to devise real- 
life mobility models for analysis and evaluation of network 



protocols. We elaborate on these applications in Section VI 



Our approach: We describe the key ideas behind our 
techniques. A naive approach to temporal community detec- 
tion is to perform static community detection independently 
in each snapshot. The limitations of this approach are well- 
documented (5), as it is very sensitive to even minor changes 
in the network. The approach can be extended to detect OTC 
structure as well but has the same limitations. Past work, 
including [2) and (5j argue that temporal communities can be 
detected if an explicit smoothness constraint that captures the 
distance to past partitions is enforced. With the smoothness 
constraint, it is possible to go beyond static methods and 
detect small persistent communities, as information at multiple 
snapshots is considered together. In this paper, we build on the 
above intuition and propose an approach for detecting OTC 
structure using temporal information. Our approach is a novel 
convex relaxation of the following combinatorial problem: find 
the temporal community structure that maximizes a quality 
function associated with each snapshot subject to a temporal 
smoothness constraint. To handle overlapping communities, 
we use generalizations of the quality function proposed in 
(7) and the temporal smoothness measure proposed in (5). 
While the quality function favors densely connected groups, 
the smoothness constraint promotes persistent structure. Our 
formulation is fairly general and allows other quality functions 
and smoothness metrics to be used. Further, it is naturally 
applicable to overlapping temporal communities and does 
not require any ad hoc modifications. Unlike most existing 
approaches that use greedy heuristics to solve the resulting NP- 
hard problem of optimizing over the combinatorial set of all 
partitions (or covers when overlaps are allowed), we use a tight 
convex relaxation of this set via the trace norm. This not only 
results in a convex optimization problem that can be solved 
efficiently using existing techniques, but also enables us to 
obtain a priori guarantees on the performance of our method, 
and provides valuable insight. In particular, our analysis shows 
that, under a natural generative model, our method is able to 
recover communities that persists for m snapshots and have 
size K > X P^, where n is the number of nodes. This result 

— V m ' 

highlights the benefit of utilizing temporal information: with 
more snapshots, we are able to detect smaller communities. 



We believe this specific relation is novel, and applies beyond 
the particular methods in this paper. We note that our approach 
can detect non-overlapping temporal communities as well. 

To summarize, we provide the first principled formulation of 
the problem of detecting overlapping temporal communities. 
A critical piece of the formulation is the quality function for 
quantifying the community structure in any snapshot, and a 
distance function to ensure contiguity with the past community 
structure. To the best of our knowledge, we are the first ones 
to propose such functions for overlapping communities. We 
provide a convex relaxation and hence an efficient way to 
solve the optimization problem, while most existing work 
relies on greedy heuristics. In addition, we provide theoretical 
guarantees on the performance of our method, and we discuss 
the insights we gain from the guarantees. Finally, we evaluate 
our method using several synthetic and real network data- sets 
and illustrate its efficacy. We also discuss applications of our 
method to communication networks. 

Remark on terminology: In the sequel, we use cluster 
and community interchangeably, both allowing overlaps. 

II. Related Work 

There is a long line of work on community detection which 
has been comprehensively surveyed in (TJ. Here, we focus on 
work that is most relevant to our approach. In particular, (8) 
presents a convex formulation for optimizing modularity (9J, 
a well-known quality function for static non-overlapping com- 
munities. Our convex formulation is completely different, and 
our allow overlapping and temporal communities. Our formu- 
lation is related to low-rank matrix recovery techniques fTQ| , 
(TTJ. This line of work typically uses trace norm as a convex 
surrogate of the non-convex rank function, and similarly the 
t\ norm for sparsity. In this paper, we use trace norm as 
a relaxation for the set of covers, a combinatorial and non- 
convex set, and the (weighted) £\ norm as the quality function. 
Similar relaxation for static clustering without overlaps is 
considered in [12]. Our approach for dealing with overlaps 
is very different from exiting work; for a survey we refer to 
p3| and section XI of (TJ. In the rest of this section we focus 
on an overview of temporal clustering. 

Most existing work on temporal clustering can be divided 
into two categories: 1) maximize quality function subject 
to smoothness constraints; 2) slightly modify the clustering 
structure from the previous snapshot. 

The first approach starts with [2], which proposes the 
framework of Evolutionary Clustering that aims to optimize a 
combination of a snapshot quality and a temporal smoothness 
cost. The work in [5 ] argues that a specific choice of the 
temporal cost, namely estrangement, works well. In (5) the 
authors uses the KL-divergence in both the snapshot quality 
and temporal cost, and reformulates the problem using non- 
negative matrix factorization in order to obtain soft clusterings. 
In (4) a particle-density based method is proposed to opti- 
mize the clustering objective. All of these works use greedy 
algorithms to solve the optimization problem, which only 
guarantees convergence to a local optimum. Our formulation is 
similar to the Evolutionary Clustering framework, but we are 
able to use convex optimization via a reasonable relaxation. 



The second class of methods typically work as follows: 
each time the network changes, they modify the clustering 
structure to reflect the change according to some predefined 
rules. Smoothness is maintained since the modification would 
not change the clustering too much. For example, p4| pro- 
poses iLCD (intrinsic Longitudinal Community Detection) 
algorithm, which updates, merges, and creates communities 
based on the previous clustering; overlap is allowed. |T5| 
adopts a similar idea, but does not allow a node to belong 
to multiple clusters while follow-up work |T6| removes this 
restriction. However, all of these works use update rules 
that are based on heuristics; some of them might produce 
duplicates or very small communities and need to use ad hoc 
procedures to remove them. Unlike our work, they do not 
provide any analytical guarantees. 

Other existing approaches include fT7|-|T9|, which use 
objective functions that essentially measure both snapshot 
quality and temporal smoothness. Also, |20| propose a method 
to detect communities in multi-dimensional networks. None of 
these however detect overlapping communities and simultane- 
ously detect their evolution over time. 

III. Formulation and Algorithm 

In this section, we formulate the problem and describe 
our algorithm. We consider the following natural formulation 
of OTC detection. Suppose we are given T snapshots of a 
network with n nodes in terms of adjacency matrices A*, 
t = 1,...,T [] Our general goal is, at each time t, to 
assign each node to a number of clusters so that the edge 
densities within clusters are higher than those across clusters, 
and that the assignment does not change rapidly with time. 
Note that each node might be assigned to multiple clusters, 
and clusters can overlap. A node might also be associated with 
no cluster; these nodes are called outliers, and are common 
in real networks. Mathematically, let r l be the total number 
of communities at time t. the value of r* is, of course, not 
known a priori. We would like to find T covers with outliers, 
where a cover C l with outliers means a collection of r £ subsets 
C l = {Cl, k = 1, . . . , r*} with C\ C {1, . . . , n}; again note 
that we allow outlier nodes that do not belong to any of the 
subsets. For convenience, we will use cover in the sequel when 
we actually mean cover with outliers. 

To make this formulation concrete, there are several ques- 
tions that need to be answered. 1) How to concisely represent 
a cover? 2) When overlaps are allowed, how to measure the 
quality of a cover? 3) In particular, how to avoid degenerate 
solutions? For example, declaring each edge as a cluster would 
make the in-cluster edge density 1 and across-cluster density 
0, but is an undesirable solution providing little information. 
Similarly, producing a cluster that differs from another only 
by one node hardly reveals any additional structure. 4) How 
to enforce temporal smoothness when overlap is present? 5) 
How to solve the resulting optimization problem over covers? 

In the remainder of this section, we present our precise 
approach and address the above questions. 

! We use the convention A l u = 1. 



A. Cover Matrix 

Our first step, and also a key to later development, is to 
adopt a matrix representation of a cover. We use the following 
representation from (7). 

Definition 1 (Matrix Representation of a Cover). A ma- 
trix Y E R nxn represents a cover C — {Ck} if Yij = 
\{C E C : i E C, j E C}\. That is, Y^ equals the number of 
clusters that include both node i and j. 

Each cover has a unique matrix representation. To see this, 
let us introduce the notion of a cluster assignment matrix. 

Definition 2 (Cluster Assignment Matrix). U e R nxr is the 

cluster assignment matrix of a cover C = {Ck-, k = 1, . . . , r} 
ifUik = 1 when i E Ck and zero otherwise. 

The cluster assignment matrix U is another representation 
of a cover which shows the clusters that each node belongs 
to. Clearly each cover corresponds to a unique U, and each U 
corresponds to a unique Y via the factorization Y = UU T (the 
(i, j)-th entry of the matrix UU T is the inner product of the i- 
th and j-th rows of U, which, due to the structure of U, equals 
the number of shared clusters of node i and j, i.e., Y^). In 
the sequel we will mainly use Y as the optimization variable, 
but the factorization is useful later for post-processing. 

Another way to view the cover matrix Y is that it assigns 
to each pair of nodes (i,j) a "similarity level" Y^ , measured 
by the number of shared clusters between i and j |7 |. When 
there is no overlap, the assigned similarity level is either 1 
(i, j assigned to the same cluster) or (assigned to different 
clusters). When overlaps are allowed, nodes sharing many 
clusters are considered more similar. In contrast, the network 
adjacency matrix A can be viewed as the observed similarity 
level. With this in mind, we can think of the general objective 
of OTC detection as: find a series of covers Y t such that the 
assigned similarity level is closed to the observed one at each 
t, and the covers change smoothly over time. In general the 
number of clusters that include both node i and j might be 
greater than 1, so the assigned similarity is also above 1. 

B. Overlapping Temporal Community Detection 

We now give a precise formulation of the above general 
objective. We adopt an optimization-based approach to OTC 
Detection. In particular, we consider the following framework: 

T 

max ]T/^(F<) (1) 

{Y} t=i 

T-l 

s.t. Y, d ^My t+ \y t ) <s, 

t=i 

Y l represents a cover, t = 1 , . . . , T. 

Here /^t(F £ ) is the snapshot quality, which serves two 
purposes: 1) it measures how well the cover Y l reflects the 
network A*, i.e., the closeness between the assigned similarity 
level encoded in Y l and the observed similarity level in 
A*, and 2) it prevents the algorithm from over-fitting, e.g., 
generating duplicate communities or many small communities 



overlap with each other. The function d A t+i^ A ± (F £+1 , Y f ) is 
a distance function that measures the difference between the 
covers at time t + 1 and t. Consequently, the first constraint in 
the above formulation ensures that the covers evolve smoothly 
over time. This constraint prefers the evolutionary path with 
fewer changes and reflects the inertia inherent in evolution of 
groups in real life networks. 

In this paper we focus on concave / and convex d (w.r.t. 
{F*}). This covers many existing methods for clustering with- 
out overlap. For example, / can be the modularity function (9) 

/a 00 = ( A ij ~ %m ) Y ij ( here h is the degree of 

node i in A, M is the total number of edges, and we ignore 
the pre-constant) or the correlation clustering pT| objective 
/a 00 = ~U ~ Y h (here \\X\U = \X\ is called the 
matrix £i norm of X), and d can be the estrangement (5) 
d At+ i >At , Y*) = A \^A% max (YA - , 0) . 

For OTC detection, the difficulty lies in defining quality and 
distance functions that can handle overlaps. We propose two 
novel metrics that are suitable to this task. For the snapshot 
quality /, we use the weighted i\ distance between the cover 
matrix Y and the adjacency matrix A: 



f A (Y) = -J2\C ij (Y ij -A i 



the weights Cij = 



Aj. 



2M 



where h and M are defined 



in the last paragraph. This qualify function generalizes the 
correlation clustering objective |2T| and is closely related to 
the widely-used modularity quality function [9 ] when there is 
no overlap. In particular, it penalizes three types of "errors" 
(recall is the number of clusters including both i and j, 
or the assigned similarity level between z, j): 

• Aij = 1 and Y^ = 0: nodes i and j are connected but 
they are assigned to different clusters 

• A^ = and Y^ > 1: nodes i and j are disconnected but 
they share at least one clusters, i.e., the assigned similarity 
level is positive while the observed one is zero. 

• A^ = 1 and Y^ > 1: nodes i and j are connected 
but they share more than one clusters, i.e., the assigned 
similarity level is higher than the observed one. 

Note that in the last two cases, the more clusters i and j share, 
the higher the cost is. This prevents the algorithm from over- 
fitting by generating many small clusters with lots of overlap. 

For the temporal distance d, we use: 

d A t+i, A t(Y t+1 ,Y f ) = Y^A\fA%\Y^ -FA|. 



In other words, we measure the change in the assigned 
similarity level between node i and j (i.e., the number of 
clusters that include both nodes), but only when there is an 
edge between i and j in both snapshots t and t + For non- 
overlapping clusters, this reduces to the number of persisting 
edges that change "state" from intra-cluster to inter-cluster and 
vice-versa. Our measure is a modification and generalization 
of the estrangement measure in [5 ] to overlapping clusters. 



C. Convex Relaxation 

The optimization problem ([T]) is combinatorial due to the 
constraint "Y* represents a cover". Exhaustive search is im- 
possible because there are exponentially many possible covers. 
One option is to use greedy local search, which a popular 
choice for optimizing modularity and other clustering objec- 
tives, but it only converges to local minimums and provides 
no guarantees. 

In this paper, we use convex optimization. There are two 
advantages of this approach: 1) it leads to an optimization 
problem that is efficiently solvable and guaranteed to converge 
to the global optimum, and 2) it is possible to obtain a 



priori characterization of the optimal solution (see Section IV), 
which provides interesting insights into the problem. To this 
end, we relax the cover constraint and solve the following 
optimization problem: 



max 



t=i 



(2) 



T-l 



s.t. Y, d A^^{Y t+ \Y t ) <S, 



t=i 



\Y t \\<B,t=l, 



5 5 



where dj are some weights to be chosen. In this paper, we use here II Y 



is the so-called trace norm, the sum of singular 
values of Y l . It is known that the trace norm constraint 
< B is a convex relaxation of the original cover 
constraint [7]. We briefly explain the reason here. Recall that 



T 



so a cover 



a cover matrix admits the factorization Y = UU 
Y is positive semidefinite and satisfies 

= ^ Yn = ^ # (clusters that include node i). (3) 

i i 

In particular, the right hand side in ([5J equals n if Y represents 
a partition. Therefore, as long as B is no smaller than the 
right hand side in ([3), then a cover matrix Y also satisfies the 
new constraint, which is thus a relaxation. Although the right 
hand side in ^ is unknown a priori, in practice we find that 
choosing B to be suitably large, such as lOn as is done in 
our experiment section, works well. Moreover, the constraint 
II < B effectively imposes an upper bound on the amount 
of overlap and prevents the algorithm from producing a large 
number of clusters, which is desirable on its own right. 

Trace norm is known to be a good relaxation for partition 
matrices both in theory and in practice (12), (22). All partition 
matrices with a small number of partitions (which is the 
case of interest) are low-rank, and trace norm is the tightest 
convex relaxation of low-rank matrices in a formal sense (23) . 
Moreover, trace norm utilizes the graph eigen- spectrum which 
has long been known to reveal hidden clustering structure and 
is the basis of the highly successful spectral clustering meth- 
ods. This advantage of trace norm carries over to overlapping 
clusters [7 ]. With this relaxation and our choice of / and d, ^ 
becomes a convex program and can be solved in polynomial 
time using general-purpose convex optimization packages such 
as SDPT3. In Appendix [A| we describe a specialized gradient 
descent algorithm, which is even faster. 



D. Post-processing 



IV. Theoretical Analysis 



Ideally, the optimal solution Y l would represent a cover, 
which could be easily extracted from Y l (e.g. by finding all 
maximal cliques); in the next section we provide one sufficient 
condition for this to happen. In practice, however, because of 
the relaxation, Y l may not have the structure of a cover matrix. 
But it is empirically observed that Y l is usually close to a 
cover matrix; in particular, the optimization can be viewed as a 
"denoising" procedure, which filters out most (though not all) 
of the noise in the observation A 1 and makes the underlying 
clustering structure more clear. Therefore, a good clustering 
is likely to be extracted from Y l via simple post-processing. 
We describe one such procedure below. 

Recall again that a cover matrix can be factorized as 
Y = UU T , where U is an assignment matrix of non-negative 
entries, with [7^ = 1 indicating node i in cluster k. Therefore, 
performing Non-negative Matrix Factorization (NMF) (24) 
on a cover Y gives the corresponding clustering assignment. 
When the optimal solution Y is not an cover but close to be 
one, we expect that performing NMF Y — UU T would still 
produce an approximate assignment matrix U, which is then 
rounded to be an exact assignment matrix. In particular, we 
declare node i to belong to cluster k at time t if U- k > 0.5. 



E. Remarks on Our Method 

Mapping communities: Practical application sometimes 
requires the communities at time t to be mapped to those 
at t — 1, in order to track the evolution of communities. 
In the experiment section, we use the mapping method in 
(5), which still works when Y is a cover instead of a parti- 
tion. The method involves mapping those communities across 
consecutive snapshots that have the maximal mutual Jaccard 
overlap between their constituent node-sets, and generating 
new community identifiers only when needed. 

Online algorithm: In some cases it might be interesting 
to use an online version of the algorithm ([2]): At each time t 
when a new snapshot A 1 becomes available, we obtain a new 
cover Y l by solving the following problem: 



max 

yt 
S.t. 



fA*(Y*) 

d^^Y^Y 1 - 



(4) 



<5 t 



where y £_1 is the solution from the last snapshot t — 1 and 
is considered fixed. Rigorously speaking, the solution to the 
online formulation is in general different from that to the 
offline one. But we expect in practice the online formulation 
will perform reasonably well, and various updating rules can 
be adopted to choose the online upper bound S f . We do not 
delve into this in this paper. 

Complexity and Scalability: Using the fast gradient de- 
scent algorithm, the space and time complexities of our method 
both scale linearly with the problem size (the numbers of 
nodes, edges and snapshots); see Appendix [B] for details. With 
the online implementation suggested above, the dependence on 
the snapshot number can be further alleviated. Our method is 
therefore amenable to large datasets. 



In this section we provide theoretical analysis on the perfor- 
mance of our algorithm. In particular, our analysis shows that 
if the adjacency matrices A t are generated from an underlying 
persistent partition according to a generative model, then 
with high probability our method will recover the underlying 
partition as long as K = Q(^/n/m), where K is the minimum 
cluster size in the partition and m is the number of snapshots it 
persists for. This highlights the benefit of temporal clustering: 
a small cluster of size yjn/m is likely to be undetectable 
if each snapshot is considered individually (e.g., the cluster 
might not be connected in each single snapshot), but can be 
recovered by temporal clustering if the cluster persists for 
m snapshots and all snapshots are used. This result is quite 
revealing: traditional single-snapshot clustering techniques can 
only find clusters that are large in size, but temporal clustering 
is capable of detecting clusters that are small in size but large 
in the time axis. Moreover, our theorem predicts a specific 
tradeoff between the "spatial size" K and the "temporal size" 
m: with four times more temporal snapshots, one can detect 
a cluster that is half as small spatially. We believe this is the 
first such result in the literature. 

We now present our theorem. We use a generative model 
which can be considered as a multi- snap shot version of the 
classical and widely studied planted partition model (a.k.a. 
stochastic block model) (25). 



Definition 3 (Multi-Snapshot Planted Partition Model). Sup- 
pose n nodes are in r disjoint clusters, each with size K, 
and this clustering structure does not change over time (see 
remarks after the theorem). Let Y* be the matrix that repre- 
sents this clustering. The adjacency matrices A*, t = 1, . . . , m 
are generated as follows: if node i and j are in different 
clusters, then there is an edge between them (i.e. A^ — 1) 
with probability q, independent with all others; if they are in 
the same cluster, then Aij = 1 with probability p. We assume 
q < \ < p are constants independent of n, m and K. 

Since the underlying partition does not change, we impose 
the constraint Y^t=i ^A t + 1 ,A t (Y t+1 ^Y 1 ) = 0, which is equiv- 
alent to Y l = y, Vt. Rewritten in an equivalent minimization 
form, our algorithm becomes 



mm 

Y 

s.t. 



t ij 

\\Y\\ <n. 



,y 



A 1 I 



(5) 



Note that under the multi- snapshot planted partition model, 

1 _ kjkj \ A 



we have C, 



where s 



p K 

L n. 



q (l — ^) E (q,p). The following theorem characterizes when 
([5| recovers true underlying partition matrix Y*. 

Theorem 1. Suppose dj = \Aij — s\. Under the multi- 



snapshot planted partition model, if K = ft(y^), then Y* is 
the unique optimal solution to the convex optimization problem 
([5]) with probability converging to 1 as n —> oo. 

The proof is given in Appendix [Cj 

Remark on Theorem [if Although the multi- snapshot 
planted partition model assumes that the underlying clustering 



structure does not change, and that the clusters do not overlap, 
we conjecture similar theoretical guarantees can be obtained 
with these restrictions removed. In particular, we expect that 
our algorithm can detect clusters of size ©(-y^D even ^ me 
underlying structure changes, provided that between consecu- 
tive changes there are at least m snapshots. This conjecture is 
supported by the experimental results in section |V| 

V. Experimental Results 

We apply our method to two synthetic datasets and three 
real- world datasets. Our synthetic networks are random graphs 
generated according to an underlying community structure 
evolution. Each snapshot is an instantiation of a random graph 
generated by connecting each pair of nodes sharing at least 
one community with probability 0.5, and with probability 0.2 
otherwise (including the case where one or both of the nodes 
are not in any community). Note that we allow some nodes in 
some snapshots to not belong to any community, as is often 
true in real scenarios. Also, note that nodes sharing more than 
one communities are not connected with a higher probability. 
This makes overlapping communities harder to detect and is 
a better test of the detection methods. 

Using this prescription, we generate two synthetic time- 
varying networks to validate our method and demonstrate its 
efficacy. We compare the results obtained with and without 
overlap allowed, and with and without the smoothness con- 
straint. A popular temporal clustering method using multi-slice 
modularity [[19) is also considered. 

The four real network datasets considered in this section 
include MANET, international trade, AS links, and the MIT 
Reality Mining Data. 

A. Synthetic Random Networks I 

In the first synthetic experiment, we demonstrate the advan- 
tage of considering the temporal aspect and allowing overlap, 
and that there is clustering structure that can be detected only 
if we consider both. We generate the network snapshots as 
follows. Suppose there are 120 nodes and 5 underlying com- 
munities. Community 1 is a small 15-node group including 
nodes through 14. Community 2 and 3, both of size 38, 
consist of nodes 15-52 and 47-85, respectively, and overlap 
at 5 nodes (47-52). Communities 4 and 5, both of smaller 
size 20, include node 85-104 and 100-119, respectively, and 
overlap at 5 nodes (100-104). 

Since community 1 is small, in light of Theorem [T] we 
expect that single-snapshot methods are unable to detect it due 
to noise/randomness in the network, but temporal methods will 
find them. Community 2 and 3 are large but overlap with each 
other, so only methods that allow overlap would detect them, 
even if the snapshots are considered individually. Finally, 
communities 4 and 5 are small and overlapping, and are thus 
expected to be discoverable only when both the temporal and 
overlap aspects are considered. This is indeed the case in our 
experiments. The results are shown in Fig [2] to [5] Visualizing 
overlapping temporal communities is not a trivial task. Here 
we extend the approach used in Fig.[T]to allow overlaps, which 
is explained in the caption of Fig [2] Fig [2] shows the results of 
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Fig. 2. Synthetic experiment I: Overlapping Temporal Community structure 
detected by our algorithm. In the figure, the strips between the consecutive 
vertical black lines shows the community assignment for one snapshot. For 
each snapshot, there are a number of colored vertical strips, each representing 
a community which contains nodes with corresponding labels on the Y axis. 
For example, the leftmost blue strip represents a community (Community 
3) at time containing nodes 47 to 85, and the cyan strip to its right is a 
community (Community 2) containing nodes 15 to 52; nodes 47 to 52 are 
in both communities. Across snapshots, communities with the same color are 
those that are m apped to each other using the mapping method described 
in Section |III-D| This figure shows that our method faithfully recovers the 
underlying 5 -community structure that is used to generate the network. 



our method, which nicely detects all the underlying structure. 
Fig [3] shows the result of our method but with S set to infinity, 
so there is no temporal smoothness constraint and snapshots 
are considered independently. In this case, communities 1, 4 
and 5 are not recovered completely. Fig. [4] shows the result 
when overlap is not allowed, i.e., we impose the constraint 
Yij < l,Vt, i, j.. All overlapping structure is clearly lost. Fig 
[5] shows that result when S = oo and overlap is not allowed; 
one can see a further degradation of performance. 

We also measure the performance of the above four meth- 
ods by computing the distance of the recovered commu- 
nity structure from the ground truth. We use the distance 
J2t=i ||^** ||i, where Y* f denotes the cover matrix of the 
ground truth, and Y l the one found by a clustering algorithm. 
The results are given in the second row of Table [I] The error 
is an order of magnitude smaller when both the overlapping 
and temporal aspects are considered. 

Comparison with existing schemes: Although there has 
been much work on community detection algorithms, almost 
none allows simultaneously discovering overlapping and tem- 
poral communities. Thus, we can only compare against some 
representative non-overlapping temporal community detection 
algorithms. We compare against the widely cited temporal 
community detection scheme presented in (19). This method 
involves two parameters, the resolution 7 for the modularity 
function and the inter-slice coupling strength uo. Since the 
ground truth clustering structure does not change over time, a 
large uj is used to force a static output. We then search over 
different values of 7 and use the one that gives the smallest 
error. The recovered community structure is shown in Fig. [6] 
We find that this method cannot identify the overlap structure 
(as expected), and fails to recover the non-overlapping portions 
of small communities (community 4 and 5). The recovery 
error, given in the last column of Table [TJ is also high. 



Overlaps between large communities are detected Small communities and their overlaps are not well recovered 
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Fig. 3. Synthetic experiment LClustering result when overlaps are allowed 
but without temporal smoothness constraint. Small communities and their 
overlaps are not well recovered. 




Time 

\ , A , A . A . / 

Phase I Phase II Phase III Phase IV 

Fig. 7. Clustering result of our method for synthetic experiment II. Our 
method is able to detect the merge, emerge, shrink, split, and growth of 
communities, as well as their overlaps. 
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Fig. 4. Synthetic experiment I: Clustering result with temporal smoothness 
constraint but not allowing overlaps. The overlap structure is lost. 
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Fig. 8. Clustering result of the multi-slice modularity method in 1 19] for 
synthetic experiment II. 
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Fig. 5. Synthetic experiment I: Clustering result without temporal smoothness 
constraint and no overlaps. Small communities are not well recovered and the 
overlap structure is lost. 
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Fig. 6. Synthetic experiment I: Clustering using the multi-slice modularity 
method in [19]. The two small communities 4 and 5 are incorrectly identified 
as one cluster, and the overlap structure is lost. 

TABLE I 

Distance from ground truth for synthetic experiments. 
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B. Synthetic Random Networks II 

The second synthetic experiment demonstrates the ability of 
our method to detect and track time-varying cluster structure, 
including the overlap, merger, emergence, splitting, growth, 
and shrinking of communities. We describe how we generate 
the snapshots. The network has 100 nodes, and the underlying 



clustering structure has four phases, each with 10 snapshots: 

• Phase I: There are two communities: community 1 in- 
cludes nodes to 39, and community 2 includes nodes 
40 to 79. The structure does not change during this phase. 

• Phase II: Community 1 remains the same, but community 
2, now with nodes 30 to 69, overlaps with community 1. 
The structure does not change during this phase. 

• Phase III: Communities 1 and 2 merge into a large 
community A, which consists of nodes to 69. This 
community then gradually shrinks: at each time there is 
one node leaving, and at the end of this phase, community 
A has nodes to 60. On the other hand, there is 
a new community B, including nodes 75-99, emerges 
at the beginning of this phase and remains unchanged 
throughout this phase. 

• Phase IV: Community A splits into two smaller ones con- 
sisting of nodes 0-19 and 20-59, respectively. Community 
B grows by absorbing nodes 60-74, and thus has nodes 
60-100. The structure does not change in this phase. 

As can be seen in Fig. [7] our method performs quite 
well in recovering the underlying evolving structure. This 



complements our theoretical results in section (TV] and shows 
that our method can handle overlaps and detect jumps in the 
structure. We compare our method with those that do not allow 
overlap, or ignore the temporal aspect; see Table |T| Our method 
again outperforms other methods by a large margin. 

Comparison: We also compare with the algorithm in (19). 
For this algorithm, we search for the best parameters (7,0;) 
that give the smallest error. The recovered community structure 
is shown in Fig. [8] and the error is given in the last column of 
Table [I] One observes that it cannot detect overlaps. Without 
considering overlaps, our method is competitive with a state- 
of-the-art algorithm that specializes for temporal clustering. 



C. Real MANET Scenario 

We now present results on a real wireless network with 
mobility. The data is based on the mobility trace from an 
experiment scenario in New Jersey as described in (26j. We 
use a 40 node version of the scenario where the nodes are 
organized into three teams. The teams move from an initial 
point to a target point using two primary routes over a three- 
hour period. The scenario is divided into several phases, each 
associated with a rendezvous point. During each phase the 
teams move from one rendezvous point to the next and pause 
before moving on. There are also six leader nodes which have 
high range radios and are mostly in range of each other. 

The input to our algorithm is 7 1 1 network snapshots formed 
by the wireless connection between the nodes. The physical 
locations of the nodes, as well as the underlying team structure, 
is unknown to the algorithm. The community structure found 
by our algorithm is shown in Fig. [9] We find that the leader 
nodes form a small yet persistent community (shown in orange 
in Fig. [9]), which can only be detected by our clustering 
method. We also find that the overlapping temporal community 
structure is basically invariant for each phase of scenario even 
when the topology as well as the instantaneous community 
structure without overlap is changing. Thus, we show that 
in this case the overlapping temporal community structure 
detected by our method reveals a structural pattern that remains 
invariant even with a fair bit of mobility. 

D. MIT Reality -mining 

We apply our method to a human-human contact network 
in the Reality-mining project (27). The results are shown in 
Fig. [TT] Two predominant groups can be seen, one corre- 
sponding to the staff at the MIT Media Lab, and the other 
corresponding to the students at the MIT Sloan School of 
Business. We also observe a discontinuity of the Sloan School 
community around New Year's break. 

E. International Trade Network 

Our next real dataset consists of annual trade volumes 
between pairs of countries during 1870-2006 [28]. We create 
an unweighted network each year by placing an edge between 
two countries if the trade volume between them exceeds 0.1% 
of the total trade volumes of both countries; in other words, 
an edge is drawn if their trade is significant for both of them. 
This leads to a dynamic network with 197 nodes and 137 
snapshots, which is fed to our algorithm. 

Fig. [12] shows the post- World War II (1950-2006) com- 
munity structure found by our algorithm, where the overlaps 
are not displayed (for each node, only the largest cluster it 
belongs to is shown). Five prominent trade communities can 
be immediately identified: Latin- American, US -Euro- Asian, 
Ex-USSR Block, West African, and Afro- Asian. One also 
observes the evolution of the communities, including the 
formation of the West African block in 1960 ("the Year of 
Africa") due to decolonization, the emergence of the Ex-USSR 
block after 1991, as well as Colombia and Venezuela joining 
the US-Euro-Asian Block in the 1970s. 



Temporal community structure without overlap detected by our method 




Fig. 9. Clustering results for MANET data. Top panel: community structure 
found by our method for all 711 snapshots, where the colors indicate the 
community membership of each node at each time; for each node, only the 
largest cluster it belongs to is shown; overlaps are not displayed. Middle panel: 
overlapping community structure for the first 30 snapshots; 3 teams and the 
six leader nodes are identified by our method; the six leader nodes form a 
small yet persistent community that overlaps with the other communities; 
this community can not be detected if overlap is not allowed (compare with 
Fig. [To) . Bottom panel: the observed network structure at two snapshots; 
at snapshot No.l, all nodes are densely connected with each other and 
forms a single community; at snapshot No. 26, there are three communities 
corresponding to the three teams; in addition, the six leader nodes form a 
community of its own, which is not obvious from looking at a single snapshot 
of the network but yet our method is able to detect it. 




Fig. 10. Clustering results without overlaps for MANET data for the first 
30 snapshots. The community of leader nodes is not detected. 



More information can be obtained by examining the overlap 
structure. A number of countries are associated with multiple 
communities. For example, US, Mexico, Colombia and Brazil 
belong to both US -Euro- Asian and Latin American blocks. 
France and Portugal are in the US -Euro- Asian block, but they 
both interact with the West African block for a significant 
number of years. Similarly, Ivory Coast, Ghana and Nigeria 
are mainly West African but also associated with the US-Euro- 
Asian. Several Asian/Pacific countries, including Saudi Arabia 
and Australia, have trade partners in both US -Euro- Asian and 
Afro- Asian blocks. 



Fig. 11. MIT Data. No significant overlap is observed, so we only show non- 
overlapping temporal community structure found. Two predominant groups 
can be seen, one corresponding to the staff and students at the MIT Media 
Lab, and the other corresponding to the students at the MIT Sloan School of 
Business. 
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Fig. 12. Clustering result for the International Trade Network; only years 
1950-2006 are shown (overlaps not shown for clarity). Five prominent trade 
communities (blocks) can be seen. Moreover, one can observe the emergence 
of the Ex-USSR block (after 1992), the West African Block (at 1960) and the 
Afro-Asian Block (post- 1970), as well as Colombia and Venezuela joining 
the US-Euro- Asian block (1970s, orange arrow at the top-middle part of the 
plot). Note that black is the background color and not a community. 



F. The Skitter AS Links Dataset 

Finally, to validate the performance of our algorithm on 
larger networks, we analyze the Internet topology at the 
Autonomous System (AS) level as collected by CAIDA |29| . 
We obtained quarterly snapshots of the data over an 8 year 
period starting in 2000. The data has upto 28000 nodes in some 
snapshots. Many of those are edge nodes with a low degree 
and do not belong to a community. Thus we only consider 
nodes with degree larger than 9 in at least one snapshot. The 
final dataset consists of 2807 nodes and 32 snapshots. 

Among these 2807 AS's, we identify 90 of them exhibit 
significant community structures - each of them are assigned 
to community in at least 10 snapshots. The temporal commu- 
nity structure for these nodes is shown in Fig. [I4j overlaps 
are not shown for clarity. Results with overlaps are shown in 



Fig. 15 
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(1) In 
1239, 



We make some initial observations from Fig 
upper portion we see a persistent block with AS 1 
7018, 5511, 2914, 3561, 6461, 3549, 3356, 701, 209, and 
6453. These seem to be mainly in US. (2) In the lower-right 




Fig. 13. Clustering with overlaps for the International Trade Network; only 
years 1950-2006 are shown. The figure indicates some the countries that 
are associated with multiple communities. Note that the color black is the 
background and does not indicate a community. 




Fig. 14. Clustering result for the AS Link Dataset, overlaps not shown. 

there is another smaller block with AS 8928, 286, 6695 and 
13237, which seem to be EU and DE. (3) Between 2004 and 
2005 there is some significant formation of new communities. 
A similar phenomenon has been observed in [30]. Moreover, 
by looking at the overlap structure, we find that that all the 
nodes in the US block mentioned above consistently appear 
in multiple clusters. These turn out to be Tier 1 providers or 
large internet exchange points. 

VI. Applications to Communication Networks 

We now describe some applications of community detection 
to the design of communication networks. 




Time 



Fig. 15. Clustering with overlaps for the AS Link Dataset. 



Routing in Disruption Tolerant Networks (DTNs): DTNs 
are often formed by devices that are carried around by humans 
whose mobility patterns are strongly influenced by their social 
relationships. Thus, the structures of the social graph between 
the humans and the the contact graph between the devices 
are correlated. While the contact graph can change rapidly, 
it usually possesses a relatively stable underlying structure 
that is a function of the less volatile social graph (3TJ-J35). 
This can be used to develop "social aware" routing strategies 
that use social metrics such as node centrality and community 
labels to make forwarding decisions (3TJ-(34), (36). All of 
these schemes utilize some form of community detection on 
the contact graph to infer social relations between the mobile 
nodes. However, the community detection methods used are 
generally limited to the non-overlapping and even non time- 
varying case. The community detection framework proposed 
in this paper can be used with any of these schemes while 
overcoming these limitations. This can result in significant per- 
formance gains when, for example, people belong to multiple 
social groups (e.g., friends, family, co-workers, etc.). 

Efficient Caching in Content Centric Networks: Content 
based networking is an emerging paradigm that does not 
require connection oriented protocols between producers and 
consumers of information in communication networks. Intel- 
ligent caching and replication of the content can significantly 
reduce access delays as well as the overhead costs associated 
with repeated querying and duplicate transmissions. Recent 
work [ 6 ] proposes making use of the community structure of a 
MANET to determine nodes for content replication. Assuming 
that the community structure changes on a slower time scale 
than the network topology, nodes in the same community can 
cooperate to provide an efficient and speedy access to content. 
The method proposed in this paper can provide a principled 
approach to build distributed content caching protocols. 

Developing Realistic Mobility Models: Much initial work 
on the design and analysis of routing algorithms for mobile 
networks assumed simplistic mobility models such as random 
walk, random waypoint, etc. However, the analysis of mobility 
traces from many real-life scenarios suggests that these sim- 
plistic models do not capture the details of real- world mobility 
characteristics such as periodicity and correlations due to 
social relationships between nodes. Recent work on mobility 
modeling [37], [ 38 ] attempts to capture the dependence of the 
social relationships between nodes on their mobility patterns. 
Community detection methods such as ours can be used to 
construct more refined mobility models that capture complex 
features such as the existence of overlapped communities as 



well as small yet persistent temporal communities. 

VII. Conclusion 

In this paper, we consider the problem of detecting overlap- 
ping temporal communities in dynamic networks. A convex 
optimization based approach is proposed for this problem. 
Theoretical and experimental results show that our method 
is capable of revealing interesting community structure that 
cannot be detected by methods that do not allow overlap, or 
those that do not utilize temporal information. For simplicity, 
in this work we have focused on unweighted graphs. In 
the future, we plan to extend our method to treat weighted 
graphs as well as develop distributed versions of the algorithm. 
We believe our methods have wide applications in studying 
the structure and evolution of complex networked systems 
including communication networks and social networks. 
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Appendix 



A. Fast Algorithm 

Solving the program (2| using standard package is feasible only 
for small or medium size problems. In this section, we describe a 
faster algorithm that is suitable for larger scale datasets. Our method 
is based on matrix factorization. Each positive semidefinite cover 
matrix Y* can be factorized as Y l = U t U tT , where U* G M nXr 
and \\Y\ = WU^lp. Here r is any upper-bound on the number of 
clusters at each snapshot; one can always use r = n, but a smaller 
value is more desirable. We consider the Lagrangian of the original 
constrained formulation. The optimization problem becomes 

T T-l 

max ^/([/ t [/ tT |A t )- 7 ^(i(y t+1 ,y t ), s.t.||*7*|£<R (6) 
u t=i t=i 



Choosing the multiplier 7 is equivalent to choosing S in the original 
formulation. We use (sub-)gradient descent to solve the problem: 

U* <- [U* + r (vf(U t U tT ) - jV 2 d(U t+1 U t+1T ,U t U tT ) 

-^Vid{U t U tT , I/* -1 */* -1 )) , (7) 

where V is the sub-gradient operator, V* denotes the (sub-)gradient 
w.r.t. the i-th argument, V^(X) is the Euclidean projection of X 

onto the Frobenius norm ball jz : \\Z\\ F < v^} (i.e., scale down 

X to have Frobenius norm y/~B if and only if it is outside the ball), 
and t 1 is the step size. As for all gradient descent methods, the above 
procedure is guaranteed to converge provided r* — >• 0. In this paper, 
we use a geometrically decreasing step size r l — 0.001 • 0.995*. 

B. Complexity and Scalability 

We analyze the memory and time complexity of the gradient 
descent algorithm. 

Memory complexity: We need to store the adjacency matrices 
{A f } and the factorization {U*}, which requires O(E) and 0(nrT) 
memory, respective; here E is the total number of edges in all 
snapshots, n the number of nodes, r the maximum number of clusters 
at each snapshot, and T number of snapshots. The total memory 
comple xity is 0(E + nrT). The online implementation suggested in 
Section |III-E| will further alleviate the dependence on T. 

Time complexity: The algorithm requires time Ti for computing 
a initial point, and T2 for each iteration with M iterations. Here we 
initialize U l by taking a rank-r SVD of A*. For each t this can be 
done in time 0(rE t + nr 2 ) (see (39J), where E t is the number of 
edges in snapshot t. So Ti = 0(rE + nr 2 T). Now consider the 
update (7}. The computation of the product of three (sub-)gradient 
operators with U* takes time 0(r 2 E t + nr), 0(r 2 E t ) and 0(r 2 E t ), 
respectively, by taking advantage of the fact that we can use any 
sub-gradient. The summation and the projection both take 0(nr). 
We thus have T2 = 0(r 2 E + nrT). The total time complexity is 
then 0(nr 2 T + Mr 2 E + MnrT). Characterizing the number of 
iterations M needed for a specified accuracy rigorously is difficult, 
However, as observed empirically in our simulations and many other 
studies, M is independent of E, n and T, and can be treated as O(l). 

In summary, with a bounded number of clusters r, both the space 
and time complexity scale linearly in E and nT. This is the best one 
can hope for, as it takes at least this much space and time to read 
the input and write down the final solution. 

C. Proof of Theorem [7] 

The following lemma shows that it suffices to study the Lagrangian 
formulation. Recall that ||X||i = J2i j is the matrix £1 norm 
of M. Let o denote the entry- wise product. 

Lemma 1. Y* is the unique optimal solution to (5} if there exists a A 
such that y* is the unique optimal solution to the following problem 

mm||Y||.+AX;i|C'o(Y-A f )|| 1 (8) 

t 

Proof: Let g(Y) = ||y|| # and h(Y) = £ t \\C o (Y - A 1 )^. 
Note that g(Y*) = n. By standard convex analysis and the fact that 
y* is optimal to {8j, we have the following chain of inequality: 

J5l) — min h(Y) — maxmin/i(y) + — (q(Y) — n) 

Y:g(Y)<n X' Y \' 

> \ min \h(Y) + (g(Y) - n) = h(Y*) > {5}. 
A y 

Therefore, equality holds above, which proves that y* is an optimal 
solution to (5]). We prove uniqueness by contradiction. If Y * is not the 
unique optimal solution to (5]), then there exists Y' with g{Y') < n 
and h(Y') — ([5]). Using the equality we just proved, we have 

h(Y') + \(g(Y')-n) < h(Y') = <5) = \ mm\h(Y) + (g(Y)-n), 



which contradicts the assumption that Y* is the unique optimal 
solution to ([8j. ■ 

To prove Theorem [T] it suffices to sho w that Y * is the unique 
optimal solution to {8} with A = ^ 16 S mn - We do this by showing 
that any other solution Y* + A with A^O has a higher objective 
value. 

We define a matrix W which serves as a dual certificate. Let S* = 
A' - Y\ Ot = {{hj)\Stj ^0},R= {(i,j)\Y id = 1}, and U be 
the matrix whose columns are the singular vectors of Y*. For any 
entry set £1 C [n] x [n], let In denote the matrix whose entries in Q 
equals 1 and others 0. Define W = J2T=i V± + T,t Z ^ where 



-Pn t UU T + - 



2A \C o S* + 



PncUU ] 
p * 

: E a 

(2,j)ei?no^ 



S )i 



(<J) 



E 



si 



(i,j)eR c nnc 



Due to the randomness in Q*, both ^ V* and are random 

matrices with independent zero-mean entries, whose variances are 
bounded by j^k^ an d 4A 2 ra due to the setup of the model. Under 
our choice of A and the assumption of the theorem, they are further 
bounded by Let || • || be the spectral norm (the largest singular 
value). Standard bounds on the spectral norm of random matrices 
guarantees that with probability converging to one, 



||P T xW|| < 



E^ 



+ 



< 1. 



It follows that UU T + P T ±W is a subgradient of ||^||*, which 
means (Y* + A, UU T + P T ±W) > (A, UU T + P T ±W) for all 
A. Also define F l — — sign(Pn£ (A*)), where sign(-) is the signum 
function, so {F\ A*) = ||P nt A* || r We also know C o (S f + F f ) 
is a subgradient of 1 1 C o S f 1 1 1 , so 1 1 C o (S* - A) 1 1 1 - 1 1 C o 5* 1 1 1 > 
(C o [S l + F f ), — A). Combining the above discussion, we have 

||Y + A||, - \\Y\\, +\J2(\\Co(S t - A)^ - \\CoS\) 

t 

> (lIU T + P T ± W,A^ + A^(Co (S* + F*), -A) 

We bound each of the above two terms. Notice that 

(UU T + P T ±W, A\ = (UU T + W,A^ — (P T W, A) 

= ({Ptit + Pni)(^UU T + V t + Z t ),A^-(P T W,A) 
>2A^||Pn ( (C7oA)|| 1 -||P T ^|L||A|| 1 



+E 



1 1 

ml — q 



Pn-UU T + 2A- — ? 



E ( 1 " s ) 1 (^) 



(i,j)ei?nn^ 



-2A- 



1-9 



E -i 



(i,i)Gi? c nfif 



here ||M||oo 
assumption of the theorem, we have 



norm. Under the 



^ 1 

m l — q 



and 



-2A- 



1-9 



E A ) ^ - J A II o A)^ . 



Moreover, observe that each entry of (PtW)^ = X^fcLi Wij> 
which is the sum of independent random variables with bounded 
variance as previously discussed. Under the assumption of the the- 
orem, this sum is bounded by < |raAmin{s, 1 — s} 
with probability converging to one by standard Bernstein inequality; 
||PtVK||oo is bounded by the same quantity using a union bound. It 
follows that 

(UU T + P T ^W,A) > 7 -\^\\Pn t (C o A^-^X^WP^ioA)^ 

t t 

On the other hand, we have 

A^(Co(S i+ fi),-A) 

t 

= -AEn^( C7 ° A )iii + A Ell p «f( C7oA )lli- 

t t 
Combining pieces, we obtain 

||y*+A||. - ||y||. +A^(||Co(5 t - A)^ - \\CoS\)) 



> (UU T + P T± W,A) - \\P n t(C o A)!^ +\J2\\P^(C°A)\\ 1 

t t 

>|A^||P n *(CoA)|| 1 -^A^||P nt c(CoA)|| 1 

t t 

-A^||P fit (CoA)|| 1 +A^||P nt c(CoA)|| 1 

t t 

>0. 

This completes the proof of the theorem. 



